Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
date()
## [1] "Mon Dec 4 16:59:10 2023"
hello. its 6:11 Thursday November 2nd. I expect this course to be a refresher on topics I am already familiar with. I found this course on sisu. Im looking forward to working with new data, using new visualizing techniques and some multivariate techniques. Im interested in dealing with missing data
This is an Introduction to Open Data Science course where we will look at visualizing and analyzing open data with multivariate statistical methods and follow practices that encourage open reproducible science. Link to my Github: https://github.com/kris-nader/IODS-project
Describe the work you have done this week and summarize your learning. Summary of the week (completed work and learning)
Chapter 2 consists of 2 parts:
-Data Wrangling.In this section, we use survey data and
make an analysis csv file using an R script
(./data/create_learning2014.R).
-Data Analysis. In this section, we use the
learning2014.csv file to do some exploratory analysis. This includes
plots, linear regression, and qdiagnistic plots from our model nal
model.
About the data: This data wad from a survey done to study the relationshop between learning approaches and student achievements in an intro course to statistics in finland which took place from 03-12-2014 to 10-01-2015.
It is a 183x60 data(before processing/data wrangling): 56 likert-scale variables on a scale of 1-5 3 continuous variables age, attitude, points 1 factor variable gender(male/female).
We have then created an analysis dataset for this chapter. We can then run the ./data/create_learning2014.R script which results in the learningdata2014.csv dataset in the same data directory. In this script, we create new variables based on the metadata hosted here: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS2-meta.txt and filter datapoints that have a points variable=0.
After filtering, we have 7 variables and 166 observations:
In addition, a pre-made learningdata2014.csv is available https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt.
We can use the following command to create the dataset, under the impression that the original data is in the data directory.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
#source("./data/create_learning2014.R")
After some data wrangling, we have 7 variables: Age, attitude, points, gender, deep, stra, and surf. The gender variable is a character (M/F) and others are doubles.
data=data.frame(read_csv('./data/learning2014.csv'))
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): Age, attitude, Points, deep, stra, surf
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## 'data.frame': 166 obs. of 7 variables:
## $ Age : num 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ Points : num 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
head(data)
## Age attitude Points gender deep stra surf
## 1 53 3.7 25 F 3.583333 3.375 2.583333
## 2 55 3.1 12 M 2.916667 2.750 3.166667
## 3 49 2.5 24 F 3.500000 3.625 2.250000
## 4 53 3.5 10 M 3.500000 3.125 2.250000
## 5 49 3.7 22 M 3.666667 3.625 2.833333
## 6 38 3.8 21 F 4.750000 3.625 2.416667
summary(data)
## Age attitude Points gender
## Min. :17.00 Min. :1.400 Min. : 7.00 Length:166
## 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:19.00 Class :character
## Median :22.00 Median :3.200 Median :23.00 Mode :character
## Mean :25.51 Mean :3.143 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:27.75
## Max. :55.00 Max. :5.000 Max. :33.00
## deep stra surf
## Min. :1.583 Min. :1.250 Min. :1.583
## 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417
## Median :3.667 Median :3.188 Median :2.833
## Mean :3.680 Mean :3.121 Mean :2.787
## 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167
## Max. :4.917 Max. :5.000 Max. :4.333
We can do some exploratory analysis: Attitude and Points seem to have a linear relationship and also a positive linear relationship. Other relationships with points is a bit sketchy– not as obvious as the relationship with attitude.
pairs(data[,-4])
There is also an imbalance in the number of males and females in the survey.
barplot(table(data$gender), main="Frequency of F and M in the survey",xlab="gender", ylab="Frequency",col="lightpink")
Theres also a wide range of individuals in this survey from 17 to 55 and the most freq is 21.
barplot(table(data$Age), main="age in the survey",xlab="age", ylab="Frequency",col="lightgreen")
This is a good positive linear relationship between attitude and points and weaker between other variables and points( some negative– red circles).
M = cor(data[,-4])
corrplot::corrplot(M)
Now, we can build a linear regression model with some interesting
variables. I have chosen attitude, stra and surf based on the
correlation plot from above. Based on the summary, only attitude seems
to be important predictor( based on the p value from the t statistic of
5.91 and comes from the estimate 3.3952 divided by the se error of
0.5741) and others are not significant in terms of p values. For
example, surf has a p value of 0.46 so thats super insignificant. We
will remove it and reformulate the model. In this lm1 model, the R2 is
0.192 which means that these variables only explain 19.2% of the
variablity in points which is not very good.
The coefficient of attittude can be interpreted as : as an increase of 1 unit in attitude results in an increase of exam points by 3.39 units if other variables are held constant.
lm1=lm(Points ~ stra + attitude + surf, data = data)
summary(lm1)
##
## Call:
## lm(formula = Points ~ stra + attitude + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## stra 0.8531 0.5416 1.575 0.11716
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Because surf had the highest pvalue(bad), we remove it first and we see that stra is now significant. R2 of 0.195 which is only a bit better( only 19% of the variablity in points is explained by this model). We will remove stra next but lm2 is my final model. In this case, there is a good relationship between attitude and points and a small but still technically statistically sig relationship with stra but only when they are together( would need to check independently). In this case, 1 unit increase in attitude results in a 3.4 unit increase in points when stra is held constant and 1 unit increase in stra results in a 0.91 unit increase in points when attitude is held constant.
lm2=lm(Points ~ stra + attitude , data = data)
summary(lm2)
##
## Call:
## lm(formula = Points ~ stra + attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
If we remove it again( if we only want really signif predictors), the r2 goes down to 0.18 so we better keep it. This means only 18% of the variability
lm3=lm(Points ~ attitude , data = data)
summary(lm3)
##
## Call:
## lm(formula = Points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Based on the diagnostic plots we can see if our model is good. 1. the
residuals vs fitted plot: we would expect to see random points/residuals
which means that the model is able to capture the overarching trend in
the data. if we were to see some shape( for example a parabola) we would
determine that the the linear model does not capture the data well. It
is a good fit if the red line (showing the mean of the data points) is
close to the horizontal gray dotted line.
2. The QQ plot: these points should be on the diagonal which means the
residuals are normally distributed and the pvalues and SE are reliable.
small deviations from this line are not a problem.
3.standardized residuals vs leverage: these points should not be out of
the cooks distance of 1. otherwise these would be high influence points
which could affect the slope and the estimates of the lm model would not
be reliable.
par(mfrow=c(2,2))
plot(lm2)
library(ggplot2)
library(ggpubr)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(caret)
## Loading required package: lattice
library(readr)
library(tidyr)
library(dplyr)
Welcome to this weeks learning exercise where we look into cross validation and logistic regression!
This weeks data is from a questionnare on student performance with an emphasis on the effect of alcohol. There are 2 parts to this analysis:
Data Wrangling: Join 2 datasets together to form the alko_data.csv.
Data Analysis: Study the relationship between high/low alcohol consumption and other variables.
We joined 2 datasets together to form the basis for this analysis. Please see R script **./data/create_alc.R” for more information.
alko=read_csv("./data/alko_data.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(alko)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr [1:370] "U" "U" "U" "U" ...
## $ famsize : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr [1:370] "A" "T" "T" "T" ...
## $ Medu : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr [1:370] "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr [1:370] "teacher" "other" "other" "services" ...
## $ reason : chr [1:370] "course" "course" "other" "home" ...
## $ guardian : chr [1:370] "mother" "father" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
## $ famsup : chr [1:370] "no" "yes" "no" "yes" ...
## $ activities: chr [1:370] "no" "no" "no" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "no" "yes" "yes" "yes" ...
## $ romantic : chr [1:370] "no" "no" "no" "yes" ...
## $ famrel : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr [1:370] "no" "no" "yes" "yes" ...
## $ absences : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
## $ alko_use : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. paid = col_character(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. alko_use = col_double(),
## .. high_use = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
From this we can see the structure of the data with 370 rows and 35 variables. These variables are printed below.
colnames(alko)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alko_use" "high_use"
The original source of the data is here: https://www.archive.ics.uci.edu/dataset/320/student+performance.
This data was merged from 2 datasets from secondary schools in 2
Portuguese schools. The variables include those like( i will not list
them all):
* student grades(G1,G2,G3).
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’).
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high).
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high).
absences - number of school absences (numeric: from 0 to 93).
The two original datasets consist of data from 2 classes: Math (mat) and Portuguese(por) which were merged into one. Originally, there were 33 variables but we construct 2 more:
alko_use as the average of the answers related to weekday and weekend alcohol consumption.
logical data TRUE for students for which ‘alko_use’ is greater than 2 (and FALSE otherwise.
#alko1=read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv")
#dim(alko1) ## sanity check that the datasets are the same and the data wrangling was successfull
The purpose now is to study the relationships between high/low alcohol consumption and some of the other variables in the data. We choose 4 interesting variables and look at their relationship with high_use. * AGE: I expect more heavy drinkers in older students.
GOOUT: Influence and peer pressure heavily associated with drinking in teenagers.
ABSENCES: More school absence with high alcohol use (hungover recovery).
G3: Higher grades for those that don’t drink.
a=ggplot(alko, aes(x = as.factor(high_use), y = G3))+geom_boxplot() + ylab("grade")+theme_classic()+ggtitle("Grades and High Use")
b=ggplot(alko, aes(x = high_use, y = absences))+geom_boxplot() + ylab("absences")+theme_classic()+ggtitle("Absences and High Use")
c=ggplot(alko, aes(x = high_use, y = age))+geom_boxplot() + ylab("age")+theme_classic()+
ggtitle("Age and High Use")
d=ggplot(alko, aes(x = high_use, y = goout))+geom_boxplot() + ylab("goout")+theme_classic()+
ggtitle("Going Out and High Use")
ggarrange(a, b, c, d,labels = c("A", "B", "C", "D"),ncol = 2, nrow = 2)
From these plots we can see that students that do not drink(or are heavy
drinkers) have higher grades,less absences, younger, and dont go out as
much(sadly). We can also look at cross tabulations but I dont find them
useful as a visual aid– to much numbers.
tabyl(alko, famrel, alko_use, high_use)
## $`FALSE`
## famrel 1 1.5 2 2.5 3 3.5 4 4.5 5
## 1 2 2 2 0 0 0 0 0 0
## 2 6 1 2 0 0 0 0 0 0
## 3 17 12 10 0 0 0 0 0 0
## 4 70 27 31 0 0 0 0 0 0
## 5 45 21 11 0 0 0 0 0 0
##
## $`TRUE`
## famrel 1 1.5 2 2.5 3 3.5 4 4.5 5
## 1 0 0 0 0 0 0 1 0 1
## 2 0 0 0 4 2 2 0 0 1
## 3 0 0 0 12 9 2 1 0 1
## 4 0 0 0 19 14 8 5 2 4
## 5 0 0 0 6 7 5 2 1 2
I will not subset the data just so it is easier to work with ( so i can use the .)
alko_use_select=select(alko, absences, G3, age, goout, high_use)
Model with a logistic regression(because we have a logical outcome). We can see that absences and goout are the significant variables.
glm.1=glm(high_use ~., data = alko_use_select, family = "binomial")
summary(glm.1)
##
## Call:
## glm(formula = high_use ~ ., family = "binomial", data = alko_use_select)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.77834 1.96186 -1.926 0.05412 .
## absences 0.07315 0.02237 3.270 0.00108 **
## G3 -0.04472 0.03923 -1.140 0.25435
## age 0.04563 0.11022 0.414 0.67890
## goout 0.70668 0.11917 5.930 3.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 388.67 on 365 degrees of freedom
## AIC: 398.67
##
## Number of Fisher Scoring iterations: 4
The model coefficients
coef(glm.1)
## (Intercept) absences G3 age goout
## -3.77834234 0.07314823 -0.04471930 0.04562870 0.70667982
Calculate the log-odds ratio and confidence intervals. * For one unit increase in ‘absences’, there is a 1.07589001 increase in the log odds of having high consumption of alcohol. This is in line with my hypothesis that high alcohol consumptiion and absesnces are related.
For one unit increase in ‘goout’, there is 2.20 increase in the log odds of having high consumption of alcohol. Once again, this is consistent with my hypothesis of more often going out with friends is associated with more consumption of alcohol.
For one unit increase in ‘G3’, there is a 0.95626587 increase in the log odds having high consumption of alcohol. Consistent with our hypothesis but weak.
Lastly, For one unit increase in ‘age’, there is a 1.04668570 increase in the log odds having high consumption of alcohol. Consistent with our hypothesis.
LOR=coef(glm.1) %>% exp
CI=confint(glm.1) %>% exp
## Waiting for profiling to be done...
cbind(LOR, CI)
## LOR 2.5 % 97.5 %
## (Intercept) 0.02286056 0.0004637374 1.033922
## absences 1.07589001 1.0312932049 1.127070
## G3 0.95626587 0.8852281993 1.032878
## age 1.04668570 0.8430071337 1.299954
## goout 2.02724925 1.6139414455 2.577757
To evaluate the model, we can look at the deviance
with(glm.1, null.deviance - deviance)
## [1] 63.3708
with(glm.1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 5.669695e-13
We have a significant chiseq so this model is better than a null model
We can use pseudo R2 with 0.14 which means only a small part of the deviance is explained by this model which isnt amazing. However, this isnt binded between 0 and 1 like R2
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(glm.1)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -194.3343924 -226.0197918 63.3707987 0.1401886 0.1574080 0.2231852
Using only significant variables with a 0.76 accuracy
alko_use_select2=select(alko_use_select, high_use, absences, goout)
glm2=glm(high_use ~ absences + goout, data = alko_use_select2, family = "binomial")
summary(glm2)
##
## Call:
## glm(formula = high_use ~ absences + goout, family = "binomial",
## data = alko_use_select2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.63280 0.43792 -8.296 < 2e-16 ***
## absences 0.07632 0.02225 3.430 0.000602 ***
## goout 0.73380 0.11786 6.226 4.78e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 390.30 on 367 degrees of freedom
## AIC: 396.3
##
## Number of Fisher Scoring iterations: 4
alko_use_select2$predicted_values=predict(glm2, newdata = alko_use_select2, type = "response")
alko_use_select2$predicted_values=alko_use_select2$predicted_values > 0.5
cm=confusionMatrix(as.factor(alko_use_select2$predicted_values), as.factor(alko_use_select2$high_use), mode = "everything")
plt <- as.data.frame(cm$table)
plt$Prediction <- factor(plt$Prediction, levels=rev(levels(plt$Prediction)))
ggplot(plt, aes(Prediction,Reference, fill= Freq)) +
geom_tile() + geom_text(aes(label=Freq)) +
scale_fill_gradient(low="white", high="#009194")
## (46+236)/370=0.76
Random guessing with 0.48 accuracy which is worse
set.seed(1)
high_use_random=ifelse(rbinom(370, 1, 0.5)==1,"TRUE","FALSE")
table(high_use_random, alko_use_select2$high_use)
##
## high_use_random FALSE TRUE
## FALSE 134 65
## TRUE 125 46
##(134+46)/370
Then we can start generalizing with a 10 fold cross valiation
train_=trainControl(method = "cv", number = 10)
cv =train(factor(high_use) ~.,
data = alko_use_select2,
trControl = train_,
method = "glm",
family=binomial())
cv
## Generalized Linear Model
##
## 370 samples
## 3 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 334, 333, 332, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7622768 0.3596589
Accuracy of 0.7 is an error of 0.2378378 which means is a lower error than the one given(0.26 but we can try other models just for fun)! For example this model has an accuracy of 0.7351667 and an error of 0.2648333 which is what the exercise set gives us.
cv1 =train(factor(high_use) ~ age + famrel + goout + health, data = alko,
trControl = train_,
method = "glm",
family=binomial())
cv1
## Generalized Linear Model
##
## 370 samples
## 4 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 333, 333, 333, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7459459 0.3132404
Lets try more models: we start with a full model. This consists of 34 predictors and from now we know there will definitely be overfitting.
cvfull=train(factor(high_use) ~ .,
data = alko,
trControl = train_,
method = "glm",
family=binomial())
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cvfull
## Generalized Linear Model
##
## 370 samples
## 34 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 333, 334, 333, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 1 1
And thats exactly what we get. The output says that glm didnt converge so the results are misleading. We get an accuracy of 100% which is suspicious and I wouldnt trust it. We can try another model with half the predictors(17 random pred).
index=sample(34,17,replace=FALSE)
cvhalf=train(factor(high_use) ~ .,
data = alko[,c(index,35)],
trControl = train_,
method = "glm",
family=binomial())
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cvhalf
## Generalized Linear Model
##
## 370 samples
## 17 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 333, 334, 333, 333, ...
## Resampling results:
##
## Accuracy Kappa
## 1 1
Again we get the same results. We can look into a stepwise selection and see which variables come up as most interesting–feature selection. It turns out only 2 variables are important Dalc and Walc we can try those out. This is interesting as the high use is based on alko_use which was calculated with these variables.
alko\(alko_use = (alko\)Dalc + alko\(Walc)/2 alko\)high_use=ifelse(alko_use>2,TRUE,FALSE)
But when we use these 2, we get overfitting.
cvstwo=train(factor(high_use) ~ Walc+Dalc,
data = alko,
trControl = train_,
method = "glm",
family=binomial())
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cvstwo
## Generalized Linear Model
##
## 370 samples
## 2 predictor
## 2 classes: 'FALSE', 'TRUE'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 333, 333, 333, 334, 333, 332, ...
## Resampling results:
##
## Accuracy Kappa
## 1 1
cv_logistic =function(num_predictors) {
if(num_predictors> 34){
break
}
predictors=names(alko)[c(1:(num_predictors))]
form= as.formula(paste("as.factor(high_use) ~", paste(predictors, collapse = " + ")))
model=train(form,
data = alko,
trControl = train_,
method = "glm",
family = binomial())
return(error = model$results$Accuracy)
}
num_predictors_range =seq(ncol(alko) - 1, 1, by = -1)
cv_results=sapply(num_predictors_range, cv_logistic)
plot(num_predictors_range,1-cv_results, type = 'b', col = 'blue', pch = 16, xlab = 'Number of Predictors', ylab = 'Error', main = 'Cross-Validation Performance')
set.seed(199)
library(MASS)
library(ggplot2)
library(GGally)
library(corrplot)
library(tidyr)
library(ggord)
library(cluster)
library(factoextra)
To get started, we load the Boston Dataset from the MASS package and get a glimpse of the data. This dataset consists of information on 506 houses in Boston. It has 14 variables. I will list a few:
crim : per capita rate by town
zn : proportion of residential land zoned for lots over 25,000 sq.f
indus : proportion of non-retail business acres per town.
chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
rm : average number of rooms per dwelling.
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
We can look at their relationship with ggpairs. rm seems to be the only variable that is normally distributed.
GGally::ggpairs(Boston)
The correlation matrix can also be used to analyze relationships between
variables. For example, there is a negative correlation between nox and
dis. A positive correlation between tax and nox.
cor_matrix=cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
We can now scale the data.
boston_scale=data.frame(scale(Boston))
summary(boston_scale)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
boston_scale[1:5,1:5]
## crim zn indus chas nox
## 1 -0.4193669 0.2845483 -1.2866362 -0.2723291 -0.1440749
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581
Boston[1:5,1:5]
## crim zn indus chas nox
## 1 0.00632 18 2.31 0 0.538
## 2 0.02731 0 7.07 0 0.469
## 3 0.02729 0 7.07 0 0.469
## 4 0.03237 0 2.18 0 0.458
## 5 0.06905 0 2.18 0 0.458
We will also create a new categorical variable based on the quantiles. I split into 4 categories: low, low_medium, high_medium and high
breaks=quantile(boston_scale$crim)
breaks
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime_scaled_cat =cut(boston_scale$crim, breaks = breaks, include.lowest = TRUE, labels = c("low", "medium_low", "medium_high", "high"))
table(crime_scaled_cat)
## crime_scaled_cat
## low medium_low medium_high high
## 127 126 126 127
temp=data.frame(boston_scale$crim,crime_scaled_cat)
temp[sample(1:length(boston_scale$crim),10,replace=FALSE),]
## boston_scale.crim crime_scaled_cat
## 307 -0.41137883 low
## 482 0.24352095 high
## 491 -0.39598276 medium_low
## 39 -0.39975069 medium_low
## 362 0.02596236 high
## 67 -0.41501074 low
## 125 -0.40865141 medium_low
## 223 -0.34760773 medium_high
## 481 0.25698714 high
## 489 -0.40256297 medium_low
boston_scale$crim=NULL
boston_scale$crime_scaled_cat=crime_scaled_cat
Now, we want to divide the dataset into training and testing set. The training set will be composed of 80% of the data.
training_set=sample(nrow(boston_scale), nrow(boston_scale)*.8)
train=boston_scale[training_set,]
test=boston_scale[-training_set,]
Fit the linear discriminant analysis can be fit on the training data with 94% explained with LD1. LDA is a classification and dimensionality reduction method that fits linear combinations of variables to seperate the target variable which can be multi-class or binary.
lda_training=lda(crime_scaled_cat~., train)
lda_training
## Call:
## lda(crime_scaled_cat ~ ., data = train)
##
## Prior probabilities of groups:
## low medium_low medium_high high
## 0.2500000 0.2425743 0.2450495 0.2623762
##
## Group means:
## zn indus chas nox rm
## low 0.86487545 -0.8990743 -0.155385496 -0.8389011 0.4558336
## medium_low -0.07575093 -0.2774956 0.008892378 -0.5556718 -0.1492985
## medium_high -0.38156320 0.1692037 0.165126514 0.3212364 0.2051377
## high -0.48724019 1.0170298 -0.123759247 1.0511511 -0.4286567
## age dis rad tax ptratio black
## low -0.8334632 0.7985302 -0.6805407 -0.7420261 -0.39791957 0.3753579
## medium_low -0.3282985 0.3297812 -0.5470944 -0.4671799 -0.06948515 0.3152093
## medium_high 0.3603443 -0.3478542 -0.3739957 -0.2948413 -0.28973051 0.1402815
## high 0.8117090 -0.8738603 1.6390172 1.5146914 0.78181164 -0.8900971
## lstat medv
## low -0.74684432 0.516936891
## medium_low -0.12574618 0.009746835
## medium_high -0.07154987 0.292419861
## high 0.96578752 -0.742109591
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08674173 0.655178733 -0.80015979
## indus 0.03744995 -0.313652772 0.31995982
## chas -0.09669696 -0.064330933 0.13106443
## nox 0.41352402 -0.782548101 -1.42602928
## rm -0.06165354 -0.095015482 -0.24679643
## age 0.27354470 -0.290313011 -0.03121271
## dis -0.03808615 -0.294222502 0.09415368
## rad 3.01185937 0.916333257 -0.22597380
## tax -0.02962922 -0.051343716 0.64208917
## ptratio 0.09954469 0.025313275 -0.26546086
## black -0.14904020 0.006966262 0.09945314
## lstat 0.23359305 -0.172953148 0.36480426
## medv 0.16687433 -0.440038019 -0.20341341
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9545 0.0331 0.0124
We can look at the biplot were rad appears to be the most important variable in discriminanting between classes. It also implies that rad is better able to influence the crim variable. They are also highly correlated if we look at the correlation plot from above.
ggord(lda_training, train$crime_scaled_cat)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime_scaled_cat)
# plot the lda results
plot(lda_training, dimen = 2, col = classes, pch = classes)
lda.arrows(lda_training, myscale = 1)
Then, remove the outcome variable from the test set and use it for the predictions.
crime_test=test$crime_scaled_cat
test$crime_scaled_cat=NULL
This model appears to be good at discrimanting the high classes.
test_pred=predict(lda_training,test)
table(correct = crime_test, predicted = test_pred$class)
## predicted
## correct low medium_low medium_high high
## low 21 5 0 0
## medium_low 6 21 1 0
## medium_high 1 9 17 0
## high 0 0 0 21
This model has an accuracy of 78%!
library(caret)
con_table=confusionMatrix(data=test_pred$class, reference = crime_test)
con_table
## Confusion Matrix and Statistics
##
## Reference
## Prediction low medium_low medium_high high
## low 21 6 1 0
## medium_low 5 21 9 0
## medium_high 0 1 17 0
## high 0 0 0 21
##
## Overall Statistics
##
## Accuracy : 0.7843
## 95% CI : (0.6919, 0.8596)
## No Information Rate : 0.2745
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7112
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: low Class: medium_low Class: medium_high
## Sensitivity 0.8077 0.7500 0.6296
## Specificity 0.9079 0.8108 0.9867
## Pos Pred Value 0.7500 0.6000 0.9444
## Neg Pred Value 0.9324 0.8955 0.8810
## Prevalence 0.2549 0.2745 0.2647
## Detection Rate 0.2059 0.2059 0.1667
## Detection Prevalence 0.2745 0.3431 0.1765
## Balanced Accuracy 0.8578 0.7804 0.8081
## Class: high
## Sensitivity 1.0000
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 1.0000
## Prevalence 0.2059
## Detection Rate 0.2059
## Detection Prevalence 0.2059
## Balanced Accuracy 1.0000
Now we can look at k means. The data is re-loaded and re-scaled. Distance is calculated between every observation using euclidean distance (default).
data(Boston)
boston_scale=scale(Boston)
boston_scale=as.data.frame(boston_scale)
dist_boston_eucl=dist(boston_scale)
summary(dist_boston_eucl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
A preliminary test is done first just to get a clue on kmeans. Then we calculate kmeans with different k values to make the scree plot. Im thinking 2 clusters but I will look at the silhouette distance to also look at which clusters to take. In this case we use the TWCSS which changes between 1 to 2.
km_euc = kmeans(dist_boston_eucl, centers = 4)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_boston_eucl, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
silhouette_score <- function(k){
km <- kmeans(dist_boston_eucl, centers = k)
ss <- silhouette(km$cluster, dist_boston_eucl)
mean(ss[, 3])
}
k <- 2:10
avg_sil <- sapply(k, silhouette_score)
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)
Using 2 clusters, recalculate k means
km_euc_2 = kmeans(dist_boston_eucl, centers = 2)
boston_scale$km_euc_2=km_euc_2$cluster
boston_scale[1:5,]
## crim zn indus chas nox rm age
## 1 -0.4193669 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824 0.3668034
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## dis rad tax ptratio black lstat medv
## 1 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990 0.1595278
## 2 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159 -0.4919525 -0.1014239
## 3 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324 1.3229375
## 4 1.076671 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708 1.1815886
## 5 1.076671 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866 1.4860323
## km_euc_2
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
pairs(boston_scale[1:5], col = boston_scale$km_euc_2)
Redo the same with 4 clusters on the whole original dataset and refit the model using the kmeans clustering. This model LD1 is able to discriminate 80% of the data.
set.seed(123)
data("Boston")
boston_scale=scale(Boston)
boston_scale=as.data.frame(boston_scale)
km=kmeans(boston_scale, centers = 3)
boston_b =boston_scale %>%
mutate(crim = as.factor(km$cluster))
training_index=sample(nrow(boston_b),size=nrow(boston_b)*0.8)
train1=boston_b[training_index,]
test1=boston_b[-training_index,]
lda.fit4=lda(crim ~ ., data = train1)
lda.fit4
## Call:
## lda(crim ~ ., data = train1)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2648515 0.4108911 0.3242574
##
## Group means:
## zn indus chas nox rm age
## 1 -0.4872402 1.04730816 0.02203357 1.04361007 -0.3235550 0.7718928
## 2 -0.4098805 -0.01560851 0.17830289 0.02207459 -0.2182658 0.3629085
## 3 0.9031573 -0.95171033 -0.12205807 -0.94944506 0.5913865 -1.0936979
## dis rad tax ptratio black lstat
## 1 -0.8347830 1.5554895 1.4906517 0.7336867 -0.6683009 0.77463317
## 2 -0.2418567 -0.5695300 -0.4734350 -0.1160846 0.2441335 0.07308108
## 3 1.0775158 -0.5970032 -0.6727432 -0.4522967 0.3488940 -0.85317921
## medv
## 1 -0.61489314
## 2 -0.08812743
## 3 0.63096726
##
## Coefficients of linear discriminants:
## LD1 LD2
## zn -0.11135414 0.404232974
## indus -0.24097333 -0.169503286
## chas -0.03448550 -0.103442454
## nox -0.25398584 0.123736788
## rm 0.09113006 0.416221698
## age -0.10619045 -0.917270414
## dis 0.18493191 0.493827453
## rad -2.60905423 0.903652155
## tax -0.88492911 0.268708395
## ptratio -0.10172379 -0.001364374
## black 0.16721983 -0.103047554
## lstat -0.06586241 0.051311795
## medv 0.09222017 0.216163863
##
## Proportion of trace:
## LD1 LD2
## 0.9061 0.0939
Rad is most influential variable for separating and discriminating and that this varuable influence mainly class 1. Age is the next most influential and is better able to seperate classes 2 and 3
ggord(lda.fit4, train1$crim)
Look at the 3D biplot
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
crime_train=train$crime_scaled_cat
model_predictors <- dplyr::select(train, -crime_scaled_cat)
km_euc_2 = kmeans(dist(model_predictors), centers = 2)
matrix_product <- as.matrix(model_predictors) %*% lda_training$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers',color =crime_train )
I am very sick so i will do the bare minimum
human=read_csv("./data/human_data.csv",)
## New names:
## Rows: 155 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Country dbl (9): ...1, edu2Ratio, labRatio, Edu.Exp, Life.Exp, GNI,
## Mat.Mor, Ado.Bir...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
summary(human)
## ...1 Country edu2Ratio labRatio
## Min. : 1.0 Length:155 Min. :0.1717 Min. :0.1857
## 1st Qu.: 39.5 Class :character 1st Qu.:0.7264 1st Qu.:0.5984
## Median : 78.0 Mode :character Median :0.9375 Median :0.7535
## Mean : 78.0 Mean :0.8529 Mean :0.7074
## 3rd Qu.:116.5 3rd Qu.:0.9968 3rd Qu.:0.8535
## Max. :155.0 Max. :1.4967 Max. :1.0380
## Edu.Exp Life.Exp GNI Mat.Mor
## Min. : 5.40 Min. :49.00 Min. : 581 Min. : 1.0
## 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198 1st Qu.: 11.5
## Median :13.50 Median :74.20 Median : 12040 Median : 49.0
## Mean :13.18 Mean :71.65 Mean : 17628 Mean : 149.1
## 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512 3rd Qu.: 190.0
## Max. :20.20 Max. :83.50 Max. :123124 Max. :1100.0
## Ado.Birth Parli.F
## Min. : 0.60 Min. : 0.00
## 1st Qu.: 12.65 1st Qu.:12.40
## Median : 33.60 Median :19.30
## Mean : 47.16 Mean :20.91
## 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :204.80 Max. :57.50
rownames(human)=human$Country
## Warning: Setting row names on a tibble is deprecated.
human=human[,3:10]
head(human)
## # A tibble: 6 × 8
## edu2Ratio labRatio Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.01 0.891 17.5 81.6 64992 4 7.8 39.6
## 2 0.997 0.819 20.2 82.4 42261 6 12.1 30.5
## 3 0.983 0.825 15.8 83 56431 6 1.9 28.5
## 4 0.989 0.884 18.7 80.2 44025 5 5.1 38
## 5 0.969 0.829 17.9 81.6 45435 6 6.2 36.9
## 6 0.993 0.807 16.5 80.9 43919 7 3.8 36.9
colnames(human)
## [1] "edu2Ratio" "labRatio" "Edu.Exp" "Life.Exp" "GNI" "Mat.Mor"
## [7] "Ado.Birth" "Parli.F"
most are not normally distibuted expect maybe Edu.Exp
par(mfrow = c(2,4))
hist(human$edu2Ratio, main="edu2Ratio", xlab="")
hist(human$labRatio, main="labRatio", xlab="")
hist(human$Edu.Exp, main="Edu.Exp", xlab="")
hist(human$Life.Exp, main="Life.Exp", xlab="")
hist(human$GNI, main="GNI", xlab="")
hist(human$Mat.Mor, main="Mat.Mor", xlab="")
hist(human$Ado.Birth, main="Ado.Birth", xlab="")
hist(human$Parli.F, main="Parli.F", xlab="")
summary(human)
## edu2Ratio labRatio Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
correlation plot: we have some strong correlations like edu.exp and life exp
library(corrplot)
cor_matrix=cor(human) %>% round(digits = 2)
cor_matrix
## edu2Ratio labRatio Edu.Exp Life.Exp GNI Mat.Mor Ado.Birth Parli.F
## edu2Ratio 1.00 0.01 0.59 0.58 0.43 -0.66 -0.53 0.08
## labRatio 0.01 1.00 0.05 -0.14 -0.02 0.24 0.12 0.25
## Edu.Exp 0.59 0.05 1.00 0.79 0.62 -0.74 -0.70 0.21
## Life.Exp 0.58 -0.14 0.79 1.00 0.63 -0.86 -0.73 0.17
## GNI 0.43 -0.02 0.62 0.63 1.00 -0.50 -0.56 0.09
## Mat.Mor -0.66 0.24 -0.74 -0.86 -0.50 1.00 0.76 -0.09
## Ado.Birth -0.53 0.12 -0.70 -0.73 -0.56 0.76 1.00 -0.07
## Parli.F 0.08 0.25 0.21 0.17 0.09 -0.09 -0.07 1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
now we do pca on the non standardized dataset
pca_human_raw= prcomp(human)
sum1=summary(pca_human_raw)
pca_imp_raw = round(100*sum1$importance[2,], digits = 1)
pca_imp_raw
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
biplot(pca_human_raw)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
now we do pca on standardized data
human_std=scale(human)
pca_human_std =prcomp(human_std)
sum2 <- summary(pca_human_std)
sum2
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.071 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.536 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.536 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
This seems more reasonable the other had fully 100% explained by PC1
pca_pr2_std = round(100*sum2$importance[2,], digits = 1)
pca_pr2_std
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
biplot(pca_human_std,cex = c(0.5, 0.5),)
The results of a pca on standardized vs non standardized are different.
In the 1st(raw), 100% importance on GNI. Based on standardized data, we
have postove results with education and poor results with maternal
mortality and adolescent birth rate. The 1st PC explained 54 and the
second 16% of variation in the data.
multiple correspondance analysis
library(tidyr)
library(factoextra)
library(ggplot2)
library(dplyr)
library(FactoMineR)
tea = read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
tea_data= tea[, c(4,13:17)]
summary(tea_data)
## lunch Tea How sugar
## lunch : 44 black : 74 alone:195 No.sugar:155
## Not.lunch:256 Earl Grey:193 lemon: 33 sugar :145
## green : 33 milk : 63
## other: 9
## how where
## tea bag :170 chain store :192
## tea bag+unpackaged: 94 chain store+tea shop: 78
## unpackaged : 36 tea shop : 30
##
View(tea_data)
gather(tea_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables; they will be
## dropped
mca_tea = MCA(tea_data, graph = FALSE)
mca_tea
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 6 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. of the categories"
## 4 "$var$cos2" "cos2 for the categories"
## 5 "$var$contrib" "contributions of the categories"
## 6 "$var$v.test" "v-test for the categories"
## 7 "$var$eta2" "coord. of variables"
## 8 "$ind" "results for the individuals"
## 9 "$ind$coord" "coord. for the individuals"
## 10 "$ind$cos2" "cos2 for the individuals"
## 11 "$ind$contrib" "contributions of the individuals"
## 12 "$call" "intermediate results"
## 13 "$call$marge.col" "weights of columns"
## 14 "$call$marge.li" "weights of rows"
summary(mca_tea)
##
## Call:
## MCA(X = tea_data, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327 0.163 0.104
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695 0.735 0.314
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202 0.062 0.069
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211 0.068 0.073
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202 0.062 0.069
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202 0.062 0.069
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202 0.062 0.069
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695 0.735 0.314
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067 0.007 0.003
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650 0.643 0.261
##
## 1 |
## 2 |
## 3 |
## 4 |
## 5 |
## 6 |
## 7 |
## 8 |
## 9 |
## 10 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## lunch | 0.007 0.000 0.000 0.048 | 0.608 3.468 0.064 4.362 |
## Not.lunch | -0.001 0.000 0.000 -0.048 | -0.105 0.596 0.064 -4.362 |
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003 0.929 |
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027 2.867 |
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107 -5.669 |
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127 -6.164 |
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035 3.226 |
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020 2.422 |
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102 5.534 |
## No.sugar | 0.247 1.877 0.065 4.412 | 0.026 0.023 0.001 0.473 |
## Dim.3 ctr cos2 v.test
## lunch 0.804 7.204 0.111 5.764 |
## Not.lunch -0.138 1.238 0.111 -5.764 |
## black -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.433 9.160 0.338 10.053 |
## green -0.108 0.098 0.001 -0.659 |
## alone -0.113 0.627 0.024 -2.655 |
## lemon 1.329 14.771 0.218 8.081 |
## milk 0.013 0.003 0.000 0.116 |
## other -2.524 14.526 0.197 -7.676 |
## No.sugar -0.561 12.347 0.336 -10.026 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## lunch | 0.000 0.064 0.111 |
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## sugar | 0.065 0.001 0.336 |
## how | 0.708 0.522 0.010 |
## where | 0.702 0.681 0.055 |
fviz_screeplot(mca_tea, addlabels = TRUE)
from the scree plot we can see that the 1st 4 dimensions explains 50% of
variability. In the biplot, distance means similarity so closer points
are more similar.
plot(mca_tea,invisible=c("ind"))